# Improved Methods for Divisible Load Distribution on $k$-Dimensional Meshes Using Multi-Installment 

Yeim-Kuan Chang, Jia-Hwa Wu, Chi-Yeh Chen, and Chih-Ping Chu


#### Abstract

In the divisible load distribution, the classic methods on linear arrays divide the computation and communication processes into multiple time intervals in a pipelined fashion. Li [21] has proposed a set of improved algorithms for linear arrays that can be generalized to $k$-dimensional meshes. In this paper, we first propose the algorithm $M$ (multi-installment) that employs the multiinstallment technique to improve the best algorithm $Q$ proposed by Li. Second, we propose the algorithm $S$ (start-up cost), which includes the computation and communication start-up costs in the design. Although the asymptotic speed-ups of our algorithms $M$ and $S$ derived from the closed-form solutions are the same as algorithm $Q$, our algorithms approach the optimal speed-ups considerably faster than algorithm $Q$ as the number of processors increases. Finally, we combine algorithms $M$ and $S$ and propose the algorithm $M S$. Although algorithm $M S$ has the same the asymptotic performance as algorithms $Q$ and $S$, it achieves a better speed-up when the load to be processed is very large and the number of processors is fixed or when the load to be processed is fixed and the number of processors is small.


Index Terms-Divisible load theory, linear array, k-dimensional mesh, multi-installment.

## 1 Introduction

AN efficient load scheduling on the resources of a parallel and distributed system or a multiprocessor system is highly desirable for data-intensive applications. Like other mathematical models, such as queuing theory and electric resistive circuit theory, divisible load theory (DLT) [5], [4], [25] provides a powerful tool for modeling data-intensive applications. A divisible load is a load that can be arbitrarily partitioned in a linear fashion and can be distributed to more than one processor to achieve a faster execution time. Each partitioned portion of the load (called a chunk) can be independently processed on any processor on the network.

DLT started with the architecture of a linear array of processors [11] in 1988. Since then, DLT has been widely studied in the literature. DLT gained much attention because of a landmark book written in 1996 [4] and two introductory surveys [5], [25]. The interconnection topologies, such as bus [7], [14], [18], [26], [28], linear array [6], [7], tree [2], [3], [10], [22], hypercube [7], [23], and mesh [7], [8], [13], [17], [21], have been investigated. Applications of divisible computations include linear algebra [9], [16], image processing [24], multimedia applications [1], database searching [7], [12], and Internet packet scheduling [18].

[^0]We focus on the mesh networks in this paper. In [13] and [8], circuit-switching-based algorithms for 2D and 3D meshes with start-up costs were studied. In [17], another circuit-switching-based algorithm was also proposed for 3D meshes with start-up costs and a multisweep distribution policy. However, the circuit-switching in which communication times are independent of the distances among processors is different from the store-and-forward routing used by us and most researchers in this field. In store-and-forward routing, the communication time of a transmission is linearly proportional to both load size and the distance covered. In [20], Li proposed a divisible load algorithm based on store-and-forward routing for $k$-dimensional meshes. In [21], Li also proposed improved algorithms by employing pipelined communication. Both [20] and [21] do not consider the startup costs.

One technique that can minimize the parallel computation time is the multiround policy, or multi-installment policy. In one-round algorithms, each processor receives one load share for computation. However, in multiround algorithms, at least one processor receives two or more load shares, and the load distribution exhibits some kind of repetition or periodicity [2]. Using multiple rounds improves the overlap of computation with communication and, thus, overall performance. If the start-up cost is not considered, an infinite number of rounds would lead to an optimal schedule. Many multiround divisible load algorithms for chains, star, and trees can be found in [2], [3], [12], [15], and [28].

In this paper, we develop a technique also called multiinstallment to minimize the parallel processing time. However, the proposed multi-installment technique is different from the existing multiround or multi-installment methods. Our multi-installment algorithms are new designs
improved over the divisible load algorithm proposed in [21], which is a multiround algorithm. Our multi-installment algorithms divide the load distribution process into two different intervals: regular intervals and installment intervals. The regular intervals are executed first and followed by the installment intervals. Also, the message routing method is assumed to be store-and-forward routing. The contributions of this paper are summarized as follows:

- The computation and communication start-up costs are included in the analysis, which were not considered in [21] and [6].
- We derive the closed-form solutions for the parallel execution time and speed-up of a linear array and a $k$-dimensional mesh.
- We show that the proposed algorithms perform better than those in [21].
The rest of this paper is organized as follows: In Section 2, we present the network model of divisible load algorithms. In Section 3, we review the classic method proposed in [21] and derive its closed-form solutions of the parallel execution time and speed-up by including the start-up costs. In Section 4, we extend the classic algorithms with a multiinstallment processing technique and propose a set of improved algorithms. In Section 5, we extend the proposed algorithms for linear arrays to $k$-dimensional meshes. In Section 6, we compare the performance of the classic method with proposed algorithms. We conclude the paper in Section 7.


## 2 The Model

We consider a homogeneous mesh network for analysis. A $k$-dimensional mesh $\mathrm{Mesh}_{\mathrm{k}}$ consists of $N=N_{1} \times \ldots \times N_{k}$ processors, where $k$ is a positive integer that is greater than or equal to one. In $\mathrm{Mesh}_{\mathrm{k}}$, an interior processor $P_{i}$ for $i=$ $i_{1} \times \ldots \times i_{k}$ is connected to $2 k$ neighbors $P_{j}$, where

$$
\begin{aligned}
j & =i_{1} \pm 1 \times i_{2} \times \ldots \times i_{k}, i_{1} \times i_{2} \pm 1 \times \ldots \times i_{k}, \ldots, i_{1} \\
& \times i_{2} \times \ldots \times i_{k} \pm 1
\end{aligned}
$$

In this paper, we assume that the corner processor $P_{N}$ is the initial processor that transmits load fractions to other processors for processing. It is known that we can improve the overall speed-up by using an interior processor as the initial processor [21]. Since using an interior processor is a straightforward extension, the details are omitted in this paper. All the links have the same communication speed and bandwidth. All the processors have the same processing capability. Each interior processor $P_{i}$ has $2 k$ separate ports to communicate with all its neighbors. In other words, processor $P_{i}$ can send/receive messages to/from all its neighbors simultaneously. A processor sends a load fraction to a neighbor and then it can proceed with other computation and communication activities without waiting for the completion of the load transmission process. However, a processor can only start to perform the computation after the entire load fraction assigned to it is received from its predecessor. To enhance the system performance, our load distribution algorithms allow

TABLE 1
Notations and Terminology

| $N$ | The number of processors in a multicomputer system |
| :---: | :---: |
| $L$ | The total load to be processed |
| $T_{c m}$ | The time to transmit a unit of load along a link. |
| $\begin{gathered} T_{c p} \\ T_{c p, k+1} \end{gathered}$ | $T_{c p}$ and $T_{c p, k+l}$ are the times to process a unit of load on a processor and a $k$-dimensional mesh, respectively. |
| $\beta, \beta_{k}$ | The computation-to-communication ratio of a node in the system. This ratio is $\beta=T_{c p} / T_{c m}$ when a node is a single processor. This ratio is $\beta_{k}$ when a node is an $k$-dimensional mesh. |
| $\theta_{c m}$ | The communication start-up cost in terms of delay time |
| $\theta_{c p}, \theta_{k}$ | $\theta_{c p}$ and $\theta_{k}$ are the computation start-up costs in terms of delay time when a node is a single processor and $k$-dimensional mesh, respectively. (Refer to Table 3) |
| $\tau_{j}$ | Interval $j$ or the time duration of interval $j$ |
| $x$ | It is a temporary variable used in each algorithm. The value of $x$ is computed based on L and it is different in all the algorithms studied in this paper. |
| $m$ | $m$ is the number of multi-installment intervals in the proposed algorithms $M$ and MS. |
| $k$ | The number of dimensions of the mesh |
| $\begin{gathered} \rho, \Delta, \\ \Delta_{k}, \rho_{k} \end{gathered}$ | $\rho=\beta /(N-1), \Delta=\left(\theta_{c p}-\theta_{c m}\right) / T_{c m}, \Delta_{k}=\frac{\theta_{k}-\theta_{c m}}{T_{c m}} \text { and } \rho_{k}=\frac{\beta_{k}}{N_{k}-1} .$ |
| $h(i)$ | $h(0)=x+\Delta$ and $h(i)=\rho \times h(i-1)+\Delta=\rho^{i} x+\Delta \sum_{j=0}^{i} \rho^{j}$ for $i=1$ to $m-1$, used in algorithm $M S$ running on a linear array. |
| $L_{j}^{4}$ | The amount of the load processed by processor $P_{j}$ for algorithm $A \cdot \sum_{j=1}^{N} L_{j}^{A}=L$. |
| $\begin{aligned} & T_{N}^{A, L}, \\ & T_{N, m}^{A, L}, \\ & S_{N, L}^{A, L}, \\ & S_{N, m}^{A, L} \end{aligned}$ | The parallel execution time $\left(T_{N}^{A, L}\right)$ and speedup ( $\left.S_{N}^{A, L}=\left(L T_{c p}+\theta_{c p}\right) / T_{N}^{A, L}\right)$ of algorithm $A$ for processing $L$ units of load in a system of $N$ processors. We use $T_{N}^{A}$ and $S_{N}^{A}$ when asymptotic performance is considered or when omitting $L$ or $m$ (number of multi-installments) does not incur confusion. For algorithms with multi-installment, we use $T_{N, m}^{A, L}$ and $S_{N, m}^{A, L}$. |
| $C_{i}^{N}$ | $C_{i}^{N}=N(N-1) \times \cdots \times(N-i+1) / i$ ! which is the number of combinations of $i$ components selected from a set of $N$ components. |
| >> | $E \gg F$ means $E$ is much larger than $F$. |

simultaneous transmission of all processors, rather than sequential load distribution. We list the notations and terminology used in this paper in Table 1.

## 3 The Existing Method on Linear Arrays

The classic divisible load distribution methods running on a linear array of $N$ processors divide the computation and communication processes into $N$ time intervals or stages in a pipelined fashion. We assume that the leftmost processor $P_{N}$ is the initial processor that commences the computation and communication. In the first interval, $P_{N}$ computes a load fraction and transmits another load fraction to processor $P_{N-1}$ simultaneously. In the second interval, both processors $P_{N}$ and $P_{N-1}$ compute a load fraction and transmit another load fraction to their successors (that is, $P_{N-1}$ and $P_{N-2}$ ) simultaneously. The same computation and communication processes repeat until $P_{1}$ receives a load fraction in the $(\mathrm{N}-1)$ th interval and processes the received load fraction in the last interval. Notice that $P_{1}$ only receives a load fraction once. A better load distribution method determines the sizes of the load fractions computed and transmitted by a processor in such a way that the accumulated length of all time intervals is minimized. We assume that the processor does not start its computation and communication processes until it receives the entire load fraction assigned to it. One design principle for a good load distribution method is as follows:

Load balance rule between computation and communication. All processors finish their computation and communication processes at the end of each interval simultaneously. The best load distribution method is that no computation and communication resource is idle in any interval.


Fig. 1. Load distribution diagram of algorithm $Q$.

Theoretically, an infinite number of processors could lead to optimal performance for the divisible load distribution when the computation and communication start-up costs ( $\theta_{c p}$ and $\theta_{c m}$ ) are ignored. However, we cannot ignore the start-up costs in real-world cases. The performance may degrade as the number of processors exceeds a certain value. In this section, we analyze the algorithm $Q$ proposed in [21] by including the computation and communication start-up costs that are originally neglected. As a result, the analysis results are the same as the ones in [21] when the start-up costs are set to zero.

Algorithm $Q$ on a linear array of $N$ processors divides the parallel execution process into $N$ time intervals. In each interval, some of the processors can compute and communicate a load fraction simultaneously. Allocating load to a processor is based on the load balance rule described above. Thus, in an interval, if the load allocated to a processor for computation is $w$, then the load allocated to the same processor for transmission is $\beta w$, where $\beta$ is the computa-tion-to-communication ratio of a processor defined in Table 1. As a result, the computation and communication processes are finished at the end of each interval simultaneously. The load distribution diagram of algorithm $Q$ is illustrated in Fig. 1. The initial processor is $P_{N}$. In the first time interval $\tau_{1}, P_{N}$ processes $(1 / \beta+1)^{N-2} x / \beta$ units of load and transmits $(1 / \beta+1)^{N-2} x$ units of load to $P_{N-1}$ simultaneously. At the end of interval $\tau_{1}, P_{N-1}$ completely receives $(1 / \beta+1)^{N-2} x$ units of load for processing. $P_{N-1}$ splits the received load into two parts, $(1 / \beta+1)^{N-3} x / \beta$ and $(1 / \beta+1)^{N-3} x$, in interval $\tau_{2}$. The former remains in $P_{N-1}$ for processing and the latter is transmitted to processor $P_{N-2}$. Notice that the term $(1 / \beta+1)^{N-2} x$, which is the amount of load transmitted from the initial processor $P_{N}$, comes from an elaborate calculation such that all processors receive $x$ units of load for computation at the last interval. In general, in interval $\tau_{N-j+1}$, all processors $P_{k}$ for $k=N$ to $j$ process $(1 / \beta+1)^{j-2} x / \beta$ units of load and transmits $(1 / \beta+$ $1)^{j-2} x$ units of load to $P_{k-1}$ simultaneously.

We have to recalculate the parallel execution time and the speed-up of algorithm $Q$ because the computation and communication start-up costs are to be included in the analysis. Since the load fraction processed by $P_{j}$ is $L_{j}^{Q}=$ $(1 / \beta+1)^{j-1} x$ for all $1 \leq j \leq N$ and the total load processed by all processors is $L$ (that is, $\sum_{j=1}^{N} L_{j}^{Q}=L$ ), we obtain $x=\frac{L / \beta}{(1 / \beta+1)^{N}-1}$. In algorithm Q , a processor is not able to
finish its computation and communication processes at the end of an interval simultaneously when the computation and communication start-up costs $\theta_{c p}$ and $\theta_{c m}$ are not the same. This is opposite of the load balance rule because either the computation or communication resource may be idle for some time in each time interval. The duration of an interval is the maximum of the computation time and communication time in that interval. Specifically, we have $\tau_{j}=(1+1 / \beta)^{j-1}(x / \beta) T_{c p}+\max \left\{\theta_{c p}, \theta_{c m}\right\}$ for $j=1$ to $N-1$ and $\tau_{N}=(1+1 / \beta)^{N-1}(x / \beta) T_{c p}+\theta_{c p}$ since no communication start-up cost is involved in the last interval. Thus, $T_{N}^{Q, L}=L_{N}^{Q} T_{c p}+(N-1) \max \left\{\theta_{c p}, \theta_{c m}\right\}+\theta_{c p}$, as shown in Table 2. We give the following result without a proof because the proof is straightforward:
Theorem 1. The speed-up of algorithm $Q$ is $S_{N}^{Q, L}=\frac{\left(L T_{c p}+\theta_{c p}\right)}{T_{N}^{Q, L}}$ and the asymptotic speed-up is $S_{N}^{Q}=1+\beta$ when $N$ and $L$ tend to infinity and $L \gg N$.

## 4 The Proposed Algorithms

In this section, we first propose an improved algorithm $M$ by using the multi-installment processing technique that will be compared with the original algorithm $Q$

TABLE 2
Performance Summary

| Algorithm | Parallel execution time | Speedup |
| :---: | :---: | :---: |
| $Q$ | $T_{N}^{Q, L}=\frac{L T_{c m}(1+1 / \beta)^{N-1}}{(1+1 / \beta)^{N}-1}+(N-1) \max \left\{\theta_{c p}, \theta_{c m}\right\}+\theta_{c p}$ | $S_{N}^{Q, L}=\frac{L T_{c p}+\theta_{c p}}{T_{N}^{Q, L}}$ |
| $M$ | $T_{N, m}^{M, L}=\frac{L T_{c m}\left((1+1 / \beta)^{N-1}+\sum_{i=1}^{m-1} \rho^{i}\right)}{(1+1 / \beta)^{N}+\frac{N}{\beta} \sum_{i=1}^{m-1} \rho^{i}-1}$ | $S_{N, m}^{M, L}=\frac{L T_{c p}}{T_{N, m}^{M, L}}$ |
| $S$ | $T_{N}^{S, L}=\frac{T_{c m}(L-N \Delta)(1+1 / \beta)^{N-1}}{(1+1 / \beta)^{N}-1}+\Delta T_{c p}+N \theta_{c p}$ | $S_{N}^{S, L}=\frac{L T_{c p}+\theta_{c p}}{T_{N}^{S, L}}$ |
| $M S$ | $T_{N, m}^{M S, L}=\frac{T_{c m}\left(L-N \Delta \sum_{k=0}^{m-1} \sum_{i=0}^{k} \rho^{i}\right)\left((1+1 / \beta)^{N-1}+\sum_{k=1}^{m-1} \rho^{k}\right)}{(1+1 / \beta)^{N}+\frac{N}{\beta} \sum_{k=1}^{m-1} \rho^{k}-1}+$ |  |



Fig. 2. Load distribution diagram of algorithm $M$.
proposed in [21]. Both algorithms $Q$ and $M$ do not consider start-up costs. Second, we consider the computation and communication start-up costs and propose an improved algorithm $S$ over algorithm $Q$. Third, we combine algorithms $M$ and $S$ and develop the best algorithm $M S$. We first summarize the performance results in Table 2 to give an easy reference for readers to understand the straightforward but tedious equation derivations needed in the proposed algorithms.

### 4.1 Algorithm $M$ with Multiple Installments

By carefully inspecting the load distribution in Fig. 1, it can be seen that the waiting time for a processor to receive a load fraction to compute is the main delay that limits the performance of algorithm $Q$. This waiting time is proportional to $x$. A larger $x$ results in a longer waiting time for a processor to receive a fraction of load to compute. Also, the last processor $P_{1}$ in the linear array only receives a fraction of load for computation once. The longer $P_{1}$ waits for a fraction of load to compute, the longer is the total parallel execution time. Therefore, we propose an improved algorithm $M$ that uses the multi-installment technique to shrink $x$. By multiple installments, we mean that all of the processors (including $P_{1}$ ) receive a fraction of load multiple times. The load distribution diagram of algorithm $M$ is shown in Fig. 2. The original interval $\tau_{N}$ is replaced with $m$ intervals called multi-installment intervals. The load distribution in the first $N-1$ intervals is exactly the same as that in algorithm $Q$. Each multi-installment interval except the last one ( $\tau_{N}$ to $\tau_{N+m-2}$ ) is divided into $N-1$ subintervals. In each subinterval of $\tau_{N}$, every processor computes $x /(N-1)$ units of load at the same time. In the first subinterval of $\tau_{N}, P_{N}$ transmits $x \beta /(N-$ $1)=\rho x$ units of load to its successor $\left(\mathrm{P}_{N-1}\right)$. In the second subinterval of $\tau_{N}, \mathrm{P}_{N}$ and $\mathrm{P}_{N-1}$ transmit $\rho x$ units of load to their successors. Generally, in the $k$ th subinterval of $\tau_{N}$, processors $P_{i}$ for $i=N$ to $N-k+1$ transmit $\rho x$ units of load to their successors. Other multi-installment intervals $\tau_{N+1} \ldots \tau_{N+m-2}$ are divided into $N-1$ subintervals in the same way as $\tau_{N}$. In the last interval $\tau_{N+m-1}$, every processor computes $\rho^{m-1} x$ units of load. It can be easily shown that the load balance rule is fulfilled, that is, the computation process and communication process finish simultaneously at the end of each interval $\tau_{N+i}$ for $i=0$ to $m-2$.

We give a complete pseudocode of algorithm $M$ in Fig. 3. Since the load fraction processed by $P_{j}$ is $L_{j}^{M}=$ $\left((1 / \beta+1)^{j-1}+\sum_{i=1}^{m-1} \rho^{i}\right) x$ for all $1 \leq j \leq N$ and the total load processed by all processors is $L$ (that is, $\sum_{j=1}^{N} L_{j}^{M}=L$ ), we have

$$
x=\frac{L / \beta}{(1+1 / \beta)^{N}+\frac{N}{\beta} \sum_{i=1}^{m-1} \rho^{i}-1} .
$$

Consequently, the parallel execution time is $T_{N, m}^{M, L}=T_{c p} L_{N}^{M}$, and $T_{N, m}^{M, L}$ is linearly proportional to $L$.
Theorem 2. The speed-up of algorithm $M$ is $S_{N, m}^{M}=\frac{T_{c s} L}{T_{N, m}^{M}}$ and its asymptotic speed-up is 1) $S_{N, m}^{M}=\beta+1$ if $N=\infty$,
2) $S_{N, m}^{M}=\beta+1$ if $m=\infty$ and $0<\rho<1$, or 3 ) $S_{N, m}^{M}=N$ if $m=\infty$ and $\rho \geq 1$.
Proof. Refer to Table 2 for $S_{N, m}^{M}$. If $N=\infty,(1+1 / \beta)^{N}$ and $(1+1 / \beta)^{N-1}$ are much larger than $\frac{N}{\beta} \sum_{i=1}^{m-1} \rho^{i}$ and $\sum_{i=1}^{m-1} \rho^{i}$.

$$
\begin{aligned}
& \underbrace{\text { Algorithm M }}_{01} \text { for }(\mathrm{i}=1 ; \mathrm{i}<N+m ; \mathrm{i}++)\left\{/ / \text { time interval } \tau_{\mathrm{i}}\right. \\
& \text { for }(\mathrm{i}=1 ; \mathrm{i}<N \\
& \text { for }(\mathrm{j}=\mathrm{N}-\mathrm{i} ; \mathrm{j}<=\mathrm{N} ; \mathrm{j}++) \text { do in parallel } \\
& \text { send }(1 / \beta+1)^{N-i-1} x \text { amount of load to } P_{j-1} \text {; } \\
& \text { receive }(1 / \beta+1)^{N-i-1} x \text { amount of load from } \mathrm{P}_{\mathrm{j}+1} \text { if } \mathrm{j}<\mathrm{N} \text {; } \\
& \text { compute }(1 / \beta+1)^{\mathrm{N-i-1}} x / \beta \text { amount of load; } \\
& \text { else } \\
& \text { receive }(1 / \beta+1)^{\mathrm{N}-i-1} x \text { amount of load from } \mathrm{P}_{\mathrm{N}-i+1} \text { if } \mathrm{j}<\mathrm{N} \text {; } \\
& \text { end if } \\
& \begin{array}{l}
\text { end do } \\
\text { if }(\mathrm{i}<N+m-1)
\end{array} \\
& \text { for }(\mathrm{k}=1 ; \mathrm{k}<\mathrm{N} ; \mathrm{k}++ \text { ) do } \\
& \text { for }(\mathrm{j}=1 ; \mathrm{j}<=\mathrm{N} ; \mathrm{j}++ \text { ) do in parallel } \\
& \text { if }(\mathrm{j}>N-\mathrm{k}) \\
& \text { receive } \rho^{\mathrm{i}-\mathrm{N}+1} x \text { amount of load from } \mathrm{P}_{\mathrm{j}+1} \text { if } \mathrm{j}<\mathrm{N} \text {; } \\
& \text { compute } \rho^{i-N+1} x / \beta \text { amount of load; } \\
& \text { else if }(\mathrm{j}=\mathrm{N}-\mathrm{k}) \text {, } \\
& \text { receive } \rho^{\mathrm{i}-\mathrm{N}+1} x \text { amount of load from } \mathrm{P}_{\mathrm{N}-\mathrm{k}+1} \text {; } \\
& \text { compute } \rho^{1-N+1} x / \beta \text { amount of load; } \\
& \text { else } \\
& \text { compute } \rho^{\mathrm{i}-\mathrm{N}+1} x / \beta \text { amount of load; } \\
& \text { end if } \\
& \text { else } \\
& \text { for }(j=1 ; j<=N ; j++) \text { do in parallel } \\
& \text { compute } \rho^{i-N} x \text { amount of load; } \\
& \text { end do } \\
& \text { end if } \\
& \text { end do }
\end{aligned}
$$

Fig. 3. Algorithm M.

(b)

Fig. 4. Algorithm $S$. (a) Load distribution diagram of algorithm $S$. (b) The pseudocode of algorithm $S$.

Thus, $S_{N, m}^{M}=\beta+1$. If $m=\infty$ and $0<\rho<1(N \neq \infty)$, we have $\sum_{i=1}^{m-1} \rho^{i}=\frac{\rho}{1-\rho}=\frac{\beta}{N-1-\beta}$ and, thus,

$$
S_{N, m}^{M}=\frac{(1+1 / \beta)^{N}+\frac{N}{\beta} \frac{\beta}{N-1-\beta}-1}{\frac{1}{\beta}\left((1 / \beta+1)^{N-1}+\frac{\beta}{N-1-\beta}\right)},
$$

which can be simplified as $S_{N, m}^{M}=\beta+1$. However, if $m=\infty$ and $\rho \geq 1, \sum_{i=1}^{m-1} \rho^{i}$ is much larger than $(1+$ $1 / \beta)^{N}$ and $(1+1 / \beta)^{N-1}$. Thus, $S_{N, m}^{M}=N$.

Theorem 3. For algorithms $Q$ and $M$ without start-up costs, $T_{N}^{Q} \geq T_{N, m}^{M}$ or $S_{N}^{Q} \leq S_{N, m}^{M}$ for $N \geq 1$.
Proof. We prove that

$$
\frac{T_{N}^{Q}-T_{N, m}^{M}}{L T_{c m}}=\frac{(1+1 / \beta)^{N-1}}{(1+1 / \beta)^{N}-1}-\frac{(1+1 / \beta)^{N-1}+\sum_{i=1}^{m-1} \rho^{i}}{(1+1 / \beta)^{N}-1+\frac{N}{\beta} \sum_{i=1}^{m-1} \rho^{i}} \geq 0
$$

which can be simplified as $f(N, \beta) \geq 0$, where $f(N, \beta)=\frac{N}{\beta}(1 / \beta+1)^{N-1}-(1 / \beta+1)^{N}+1$. By expanding $(1+1 / \beta)^{N-1}$ into $1+\sum_{i=1}^{N-1} \frac{C_{i}^{N-1}}{\beta^{i}}$ and $(1+1 / \beta)^{N}$ into $1+\frac{N}{\beta}+\sum_{i=2}^{N} \frac{C_{i}^{N}}{\beta^{i}}$, we obtain
$f(N, \beta)=\frac{N}{\beta} \sum_{i=1}^{N-1} \frac{C_{i}^{N-1}}{\beta^{i}}-\sum_{i=2}^{N} \frac{C_{i}^{N}}{\beta^{i}}=\sum_{i=2}^{N}(i-1) \frac{C_{i}^{N}}{\beta^{i}}$.

Since $(i-1) \frac{C_{i}^{N}}{\beta^{i}} \geq 0$ for $i=2$ to $N$, we must have $f(N, \beta) \geq 0$. Thus, the theorem follows.

### 4.2 Algorithm $S$ with Start-Up Costs

To avoid either computation or communication resource lying idle in a time interval, we propose an improved algorithm $S$ over algorithm $Q$ to distribute the divisible load in a linear array of processors with start-up costs.

Consider applying algorithm $Q$ in the situation where $\theta_{c p}>\theta_{c m}$. A processor must finish its communication process before the computation process. Thus, as the load balance rule states, we have to balance the time delays between the computation and communication processes. Let $w_{c p}$ and $w_{c m}$ be the loads assigned to processor $P_{N}$ for processing and transmitting in a certain interval, respectively. To make the processes of computing load $w_{c p}$ and transmitting load $w_{c m}$ finish at the same time, we have the equation $w_{c p} T_{c p}+\theta_{c p}=w_{c m} T_{c m}+\theta_{c m}$, which yields $w_{c m}=\beta w_{c p}+\left(\theta_{c p}-\theta_{c m}\right) / T_{c m}$. Therefore, one way (called addition method) to achieve the load balance between computation and communication processes is to add an additional amount of load $\Delta=\left(\theta_{c p}-\theta_{c m}\right) / T_{c m}$ to the communication process as illustrated in the load distribution diagram in Fig. 4a. Specifically, we add an additional load $\Delta$ to the communication process in all the intervals except the final interval. Also, each processor gets an extra computation load $\Delta$ in its final interval. Another way, called the subtraction method, is to subtract the $\Delta$


Fig. 5. Load distribution diagram of algorithm $M S$.
amount of load from the load assigned to the computation process. The parallel execution time of the subtraction method is the same as the addition method. The reason is that the variable $x$ is able to adapt to these two methods. Consequently, both methods work when $\Delta$ is negative, that is, $\theta_{c p}<\theta_{c m}$. We use the addition method in this paper. We give a complete pseudocode of algorithm $S$ in Fig. 4b. Since the load processed by processor $P_{j}$ is $L_{j}^{S}=(1+1 / \beta)^{j-1} x+\Delta$ for all $1 \leq j \leq N$ and $\sum_{j=1}^{N} L_{j}^{S}=L$, we obtain

$$
x=\frac{L-N \Delta}{\beta(1+1 / \beta)^{N}-\beta}
$$

and the parallel execution time $T_{N}^{S, L}=T_{c p} L_{N}^{S}+N \theta_{c p}$ (please see Table 2). $N$ is upper bounded by the inequality $L \geq N \Delta$ because, otherwise, variable $x$ becomes negative.
Theorem 4. The speed-up of algorithm $S$ is $S_{N}^{S, L}=\frac{\left(L T_{c p}+\theta_{c p}\right)}{T_{N}^{S, L}}$ and
the asymptotic speed-up is $S_{N}^{S}=1+\beta$ when $N$ and $L$ tend to infinity and $L \gg N$.
Proof. The proof is straightforward.
Theorem 5. $S_{N}^{S, L} \geq S_{N}^{Q, L}$ and $S_{N}^{Q, L} \cong S_{N}^{S, L}$ if $L \gg N$.
Proof. $S_{N}^{S, L} \geq S_{N}^{Q, L}$ means that $T_{N}^{S, L} \leq T_{N}^{Q, L}$. Therefore, we have to prove that

$$
\begin{aligned}
& \frac{T_{c m}(L-N \Delta)(1+1 / \beta)^{N-1}}{(1+1 / \beta)^{N}-1}+\Delta T_{c p}+N \theta_{c p} \\
& \leq \frac{T_{c m} L(1+1 / \beta)^{N-1}}{(1+1 / \beta)^{N}-1}+(N-1) \max \left\{\theta_{c p}, \theta_{c m}\right\}+\theta_{c p}
\end{aligned}
$$

Since $N \theta_{c p}$ must be smaller than or equal to $(N-1) \max \left\{\theta_{c p}, \theta_{c m}\right\}+\theta_{c p}$, we only need to prove that
$\Delta \beta-N \Delta \frac{(1+1 / \beta)^{N-1}}{(1+1 / \beta)^{N}-1} \leq 0$ or $N \frac{(1+1 / \beta)^{N-1}}{(1+1 / \beta)^{N}-1} \geq \beta$.
By expanding $(1+1 / \beta)^{N-1}$ and $(1+1 / \beta)^{N}$, we have

$$
N \frac{(1+1 / \beta)^{N-1}}{(1+1 / \beta)^{N}-1}=\beta \frac{\sum_{i=0}^{N-1} \frac{N C_{i}^{N-1}}{\beta^{i}}}{\sum_{i=1}^{N} \frac{C_{i}^{N}}{\beta^{i-1}}},
$$

which is larger than or equal to $\beta$ because the $\frac{N C_{j}^{N-1}}{\beta^{j}}$ of the numerator must be larger than or equal to $\frac{C_{j+1}^{N}}{\beta^{j}}$ of the denominator for $j=0$ to $N-1$. Therefore, $T_{N}^{S, L} \leq T_{N}^{Q, L}$. Similarly, if $L \gg N, T_{N}^{Q, L} \cong T_{N}^{S, L}$. Thus, the theorem follows.

### 4.3 Algorithm MS with Start-Up Costs and Multiple Installments

In this section, we propose an algorithm $M S$ that is better than algorithm $S$ by utilizing the multi-installment technique as in algorithm $M$. The load distribution diagram is illustrated in Fig. 5. The complete pseudocode of algorithm $M S$ is also given in Fig. 6. As in algorithm $M$, the original interval $\tau_{N}$ in algorithm $M S$ is replaced with $m$ multi-installment intervals, and each multi-installment interval is divided into $N-1$ subintervals. In general, in the $k$ th subinterval of $\tau_{N+j-1}$ for $j=1$ to $m-1$, all processors $P_{i}$ for $i=N$ to $N-k+1$ compute $h(j-1)$ units of load and transmit $h(j)$ units of load to their successors, where $h(j)$ is defined in Table 1.

With $h(i)=\rho \times h(i-1)+\Delta$, it can be easily shown that the load balance rule between the computation and communication processes is satisfied. Since the load processed by processor $P_{j}$ is

```
Algorithm MS
for \((\underset{i}{ }=1 ; \mathrm{i}<N+m ; \mathrm{i}++)\left\{/ /\right.\) time interval \(\tau_{\mathrm{i}}\)
\(\begin{array}{ll}\text { if }(\mathrm{i}<N) \\ \text { the same as lines 03-11 o } \\ 3 & \text { else if }(\mathrm{i}<N+m-1) \\ \text { for }(\mathrm{k}=1 ; \mathrm{k}<N ; \mathrm{k}++) \text { do }\end{array}\)
                for \((\mathrm{j}=1 ; \mathrm{j}<=\mathrm{N} ; \mathrm{j}++\) ) do in parallel
                    if \((\mathrm{j}>N-\mathrm{k})\)
                        send \(h(i-N)\) amount of load to \(\mathrm{P}_{\mathrm{j}-1}\);
                        receive \(h(i-N)\) amount of load from \(\mathrm{P}_{\mathrm{j}+1}\) if \(\mathrm{j}<\mathrm{N}\);
                                compute \(h(i-N) / \beta\) amount of load;
                                else if \((\mathrm{j}=\mathrm{N}-\mathrm{k})\)
                        receive \(h(i-N)\) amount of load from \(\mathrm{P}_{\mathrm{N}-\mathrm{k}+1}\);
                                    compute \(h(i-N) / \beta\) amount of load;
                                    else
                                    compute \(h(i-N) / \beta\) amount of load;
                    end if
                end do
            end do
    else
        for \((\mathrm{j}=1 ; \mathrm{j}<=\mathrm{N} ; \mathrm{j}++)\) do in parallel
            compute \(h(\mathrm{~m}-1)\) amount of load;
        end do
    end if
end do
```

Fig. 6. Algorithm MS.

$$
\begin{aligned}
L_{j}^{M S} & =x(1 / \beta+1)^{j-1}-x+\sum_{i=0}^{m-1} h(i) \\
& =x(1 / \beta+1)^{j-1}+x \sum_{k=1}^{m-1} \rho^{k}+\Delta \sum_{k=0}^{m-1} \sum_{i=0}^{k} \rho^{i}
\end{aligned}
$$

for all $1 \leq j \leq N$ and $\sum_{j=1}^{N} L_{j}^{M S}=L$, we have

$$
x=\frac{\frac{1}{\beta}\left(L-N \Delta \sum_{k=0}^{m-1} \sum_{i=0}^{k} \rho^{i}\right)}{(1+1 / \beta)^{N}+\frac{N}{\beta} \sum_{k=1}^{m-1} \rho^{k}-1} .
$$

$N$ is upper bounded by the equation $L \geq N \Delta \sum_{k=0}^{m-1} \sum_{i=0}^{k} \rho^{i}$ because, otherwise, variable $x$ becomes negative. Also, the parallel execution time to process $L$ units of load is $T_{N, m}^{M S, L}=$ $T_{c p} L_{N}^{M S}+(N-1) m \theta_{c p}+\theta_{c p}$ (see Table 2). Notice that algorithm $S$ is the same as algorithm $M S$ when the number of installments $m=1$, and algorithm $M$ is the same as algorithm $M S$ when $\Delta=0$. We summarize our results in the following theorems:
Theorem 6. The speed-up of algorithm $M S$ is $S_{N, m}^{M S, L}=\frac{L T_{c p}+\theta_{c p}}{T_{N, m}^{M S, L}}$, and the asymptotic speed-up is $S_{N}^{M S, L}=1+\beta$ when $N$ and $L$ tend to infinity and $L \gg N$.

Proof. When $N$ tends to infinity, we obtain $\sum_{k=0}^{m-1} \sum_{i=0}^{k} \rho^{i}=m, \quad \frac{N}{\beta} \sum_{k=1}^{m-1} \rho^{k} \approx 1$, and $\sum_{k=1}^{m-1} \rho^{k} \approx 0$ since $\rho=\frac{\beta}{N-1}$. Thus, the theorem follows.
Theorem 7. $S_{N}^{S, L} \geq S_{N, m}^{M S, L}$ when $L$ is constant and $N$ tends to infinity.
Proof. If $N$ tends to infinity, $\sum_{k=0}^{m-1} \sum_{i=0}^{k} \rho^{i} \approx m$, $\frac{N}{\beta} \sum_{k=1}^{m-1} \rho^{k} \approx 1$, and $\sum_{k=1}^{m-1} \rho^{k} \approx 0$. Thus, we have $T_{N, m}^{M S, L} \cong$ $\frac{L T_{c p}}{1+\beta}+\frac{m N}{1+\beta}\left(\beta \theta_{c m}+\theta_{c p}\right)$. Since $\beta \theta_{c m}+\theta_{c p} \geq 0$, the larger $m$ is, the longer $T_{N, m}^{M S, L}$ is. As a result, $T_{N, 1}^{M S, L}=T_{N}^{S, L}$ is shorter than $T_{N, m \geq 2}^{M S}$. Thus, the theorem follows.
Theorem 8. $S_{N, m}^{M S, L} \geq S_{N}^{S, L}$ if $N$ is a constant and $L$ tends to infinity.
Proof. It is sufficient to prove that $T_{N}^{S, L}-T_{N, m}^{M S, L} \geq 0$. Let $\mathrm{EQ}=T_{N}^{S, L}-T_{N, m}^{M S, L}$. Based on whether or not the terms in EQ depend on $L$ (refer to Table 2), EQ can be simplified as

$$
\begin{aligned}
& L T_{c m} S\left((1+1 / \beta)^{N-1}(N-1-\beta) / \beta+1\right) \\
& +g\left(N, T_{c p}, T_{c m}, \beta, \theta_{c p}, \theta_{c m}\right)
\end{aligned}
$$

where $S=\sum_{k=1}^{m-1} \rho^{k}$ and $g\left(N, T_{c p}, T_{c m}, \beta, \theta_{c p}, \theta_{c m}\right)$ contains all the terms in $E Q$ that do not depend on $L$. Since all the parameters $N, T_{c p}, T_{c m}, \beta, \theta_{c p}$, and $\theta_{c m}$ are constants, $g\left(N, T_{c p}, T_{c m}, \beta, \theta_{c p}, \theta_{c m}\right)$ is a constant. As a result, we only have to prove that $(1+1 / \beta)^{N-1}(N-1-$ $\beta) / \beta+1>0$ because $T_{c m} S$ is positive and $L$ can be very large. The same proof has been given in Theorem 3. Thus, the theorem follows.

### 4.4 Maximum Number of Installments or Processors

 By setting $\frac{d}{d N} T_{N}^{Q, L}=0$, we are able to obtain the maximum number of processors $N_{\max }^{Q}$ such that, if the number of processors exceeds $N_{\max }$, then the speed-up of algorithm $Q$decreases. The detailed analysis to obtain $N_{\text {max }}^{Q}$ is given in Appendix A. Since the number of processors is an integer that must be larger than or equal to one, extra computations are needed to determine if $\left\lfloor N_{\text {max }}^{Q}\right\rfloor$ or $\left\lceil N_{\max }^{Q}\right\rceil$ produces the best speed-up.

Subsequently, we only analyze algorithm $M S$ since algorithms $M$ and $S$ are its special cases. By setting $\frac{d}{d m} T_{N, m}^{M S, L}=0$, we are able to obtain the number of installments $m_{\max }$ such that, if the number of installments exceeds $m_{\max }$, then the speed-up decreases. We divide algorithm $M S$ into two cases: $\rho=1$ and $\rho \neq 1$. We present the result for the case of $\rho=1$ here, whereas the result for the case of $\rho \neq 1$ is given in Appendix B.

When $\rho=1$, we have the following parallel execution time:

$$
\begin{aligned}
T_{N, m}^{M S, L}= & \frac{T_{c m}\left(L-\frac{N m(m+1) \Delta}{2}\right)\left(\left(\frac{1}{\beta}+1\right)^{N-1}+m-1\right)}{\left(\frac{1}{\beta}+1\right)^{N}-1+\frac{N(m-1)}{\beta}} \\
& +\frac{T_{c p} m(m+1) \Delta}{2}+m(N-1) \theta_{c p}+\theta_{c p} .
\end{aligned}
$$

$T_{N, m}^{M S, L}$ can be simplified as $\frac{A m^{2}+B m+C}{\frac{N}{\beta} m+\left(\frac{1}{\beta}+1\right)^{N}-1-\frac{N}{\beta}}$, where

$$
C=T_{c m} L\left(\left(\frac{1}{\beta}+1\right)^{N-1}-1\right)
$$

$A=T_{c m}\left(\frac{-N+T_{c p}+1}{2} \Delta\right)\left(\frac{1}{\beta}+1\right)^{N-1}-\frac{T_{c p} \Delta}{2}+\frac{N(N-1) \theta_{c p}}{\beta}$, and

$$
\begin{aligned}
B= & T_{c m} L+\left(\frac{-T_{c m} N+T_{c p}+T_{c m}}{2} \Delta+(N-1) \theta_{c p}\left(\frac{1}{\beta}+1\right)\right) \\
& \left(\frac{1}{\beta}+1\right)^{N-1}-\frac{T_{c p} \Delta}{2}-(N-1) \theta_{c p}-\frac{N(N-1) \theta_{c p}}{\beta} .
\end{aligned}
$$

To obtain $m_{\text {max }}$, we perform $\frac{d T_{N, m}^{M S, L}}{d m}=0$, which leads to

$$
\begin{aligned}
\frac{A N}{\beta} m^{2} & +2 A m\left(\left(\frac{1}{\beta}+1\right)^{N}-1-\frac{N}{\beta}\right) \\
& +B\left(\left(\frac{1}{\beta}+1\right)^{N}-1-\frac{N}{\beta}\right)-C \frac{N}{\beta}=0
\end{aligned}
$$

As a result, obtaining $m_{\max }$ for a minimum $T_{N, m}^{M S, L}$ becomes solving the above quadratic equation in one variable. Since the number of installments is an integer and must be larger than or equal to one, extra computations must be performed to determine if $\left\lfloor m_{\max }\right\rfloor$ or $\left\lceil m_{\max }\right\rceil$ produces the best speed-up.

## 5 Extension to k-Dimensional Meshes

To extend the results for linear arrays to $k$-dimensional meshes, a $k$-dimensional mesh $\mathrm{Mesh}_{k}$ of size $N=N_{1} \times N_{2} \times$ $\ldots \times N_{k}$ is treated as a linear array $M e s h_{1}$ of $N_{k}$ nodes, where each node $P_{j}, 1 \leq j \leq N_{k}$, is a $(k-1)$-dimensional submesh Mesh ${ }_{k-1}$ of size $N_{1} \times N_{2} \times \ldots \times N_{k-1}$. Notice that $P_{j}$ is not a single processor when $k>1$. We denote the
algorithm $A$ operated on a $k$-dimensional mesh $\mathrm{Mesh}_{k}$ by $A_{k}$. In $A_{k}$, the computation time taken by a node to process $L$ units of load is the time (that is, $T_{N_{1} \times \cdots \times N_{k-1}}^{A, L}$ ) taken for processing load $L$ in a $(k-1)$-dimensional submesh Mesh $_{k-1}$ using algorithm $A$, which may not be the same as $L \times T_{N_{1} \times \cdots \times N_{k-1}}^{A, 1}$, where $T_{N_{1} \times \cdots \times N_{k-1}}^{A, 1}$ is the time taken for processing a unit of load in a $(k-1)$-dimensional submesh. In other words, $T_{N_{1} \times \cdots \times N_{k-1}}^{A, L}$ may not be linearly proportional to $L$. Depending on if $T_{N_{1} \times \ldots \times N_{k-1}}^{A, L}$ is linearly proportional to $L$, we divide the algorithms developed in this paper into two categories based on whether or not the computation and communication start-up costs are included in the analysis.

### 5.1 Algorithms without Start-Up Costs

Both $M_{k}$ and $Q_{k}$ belong to this category. We only consider algorithm $M_{k}$ because $Q_{k}$ has been analyzed in [21]. The analysis in Section 4.1 shows that the parallel execution time of a linear array (that is, a one-dimensional mesh) using algorithm $M$ is linearly proportional to $L$. Thus, the time denoted by $T_{c p, 2}$ to compute a unit of load on a linear array of $N_{1}$ processors is simply $T_{N_{1}}^{M, 1}$. In general, the time to compute a unit of load on a $k$-dimensional mesh Mesh $_{\mathrm{k}}$ (denoted by $T_{c p, k+1}$ ) is simply $T_{N_{1} \times \cdots \times N_{k}}^{M, 1}$. Therefore, the parallel execution time and the speed-up of algorithm $M$ running on a $k$-dimensional mesh Mesh $_{k}$ of size $N=N_{1} \times N_{2} \times \ldots \times N_{k}$ to process $L$ units of load can be expressed recursively as follows: $T_{N_{1} \times \cdots \times N_{k}}^{M, L}=L T_{c p, k+1}^{T_{c p, k}}$ and $S_{N_{1} \times \cdots \times N_{k}}^{M, L}=\frac{L T_{c p}}{T_{N_{1} \times \cdots \times N_{k}}^{M, L}}=\frac{T_{c p}}{T_{c p, k+1}}$, where $\beta_{k}=\frac{T_{c p, k}}{T_{c n}}, \quad \rho_{k}=\frac{\beta_{k}}{N_{k}-1}$, and

$$
T_{c p, k+1}=\frac{T_{c m}\left(\left(1+\frac{1}{\beta_{k}}\right)^{N_{k}-1}+\sum_{i=1}^{m_{k}-1}\left(\rho_{k}\right)^{i}\right)}{\left(1+\frac{1}{\beta_{k}}\right)^{N_{k}}+\frac{N_{k}}{\beta_{k}} \sum_{i=1}^{m_{k}-1}\left(\rho_{k}\right)^{i}-1}
$$

for $k \geq 1$, and $m_{k}$ is the number of installment intervals in the $k$ th dimension. The base conditions are $T_{c p, 1}=T_{c p}$ and $\theta_{c p, 1}=\theta_{c p}$ and, thus, $\beta_{1}=\beta$.
Lemma 1. The asymptotic $\beta_{k+1}$ value of algorithm $M$ running on a $k$-dimensional mesh Mesh $_{k}$ of size $N=N_{1} \times N_{2} \times \ldots \times N_{k}$ is 1) $\frac{\beta}{k \beta+1}$ if $N_{i}=\infty$ for all $i=1$ to $k$, 2) $\frac{\beta}{k \beta+1}$ if $m_{i}=\infty$ and $0<\rho_{i}<1$ for all $i=1$ to $k$, or 3) $\frac{\beta}{N_{1} \times \cdots \times N_{k}}$ if $m_{i}=\infty$ and $\rho_{i} \geq 1$ for all $i=1$ to $k$.
Proof. The lemma is a direct extension of Theorem 2. We prove the lemma by induction. We prove case 1 only since cases 2 and 3 can be proved by the same induction analysis. The base condition is $\beta_{1}=\beta$. Assume that $\beta_{k}=\frac{\beta}{(k-1) \beta+1}$, and $N_{i}=\infty$ for all $i=1$ to $k-1$. Since $N_{k}=\infty,\left(1+1 / \beta_{k}\right)^{N_{k}}$ and $\left(1+1 / \beta_{k}\right)^{N_{k}-1}$ are much larger than $\frac{N_{k}}{\beta_{k}} \sum_{i=1}^{m_{k}-1} \rho_{k}^{i}$ and $\sum_{i=1}^{m_{k}-1} \rho_{k}^{i}$, respectively. Thus, $\beta_{k+1}=\frac{\beta_{k}}{1+\beta_{k}}=\frac{\beta}{k \beta+1}$.
Theorem 9. The asymptotic speed-up of algorithm $M$ running on a $k$-dimensional mesh Mesh $_{k}$ of size $N=N_{1} \times N_{2} \times \ldots \times$ $N_{k}$ is 1) $S_{N_{1} \times \cdots \times N_{k}}^{M}=k \beta+1$ if $N_{i}=\infty$ for $i=1$ to $k$,
2) $S_{N_{1} \times \cdots \times N_{k}}^{M}=k \beta+1$ if $m_{i}=\infty$ and $0<\rho_{i} \leq 1$ for $i=1$ to $k$, or 3) $S_{N_{1} \times \cdots \times N_{k}}^{M}=N_{1} \times \cdots \times N_{k}$ if $m_{i}=\infty$ and $\rho_{i}>1$ for $i=1$ to $k$.
Proof. The proof can be shown directly from Lemma 1.
Theorem 10. For algorithms $Q$ and $M$ without start-up costs, we have $T_{N_{1} \times \cdots \times N_{k}}^{Q, L} \geq T_{N_{1} \times \cdots \times N_{k}}^{M, L}$ or $S_{N_{1} \times \cdots \times N_{k}}^{Q, L} \leq S_{N_{1} \times \cdots \times N_{k}}^{M, L}$ if $N_{i} \geq 1$ for $i=1$ to $k$.
Proof. We can prove the theorem by induction and by using the same analysis as in Theorem 3.

### 5.2 Algorithms with Start-Up Costs

Consider algorithms $Q, S$, or $M S$ running on a $k$-dimensional mesh $M e s h_{\mathrm{k}}$ that is treated as a linear array $M e s h_{1}$ of $N_{k}$ nodes. If the computation and communication start-up costs are considered, the parallel execution time of processing $L$ units of load in each node is not linearly proportional to $L$. In other words, the computation-to-communication ratio of each node in the linear array $M e s h_{1}$ is not $T_{N_{1} \times \cdots \times N_{k-1}}^{A, 1}$, where $A$ is $Q, S$, or $M S$.

Based on the results shown in Table 2, the parallel execution time of a linear array of $N_{1}$ processors is

$$
T_{N_{1}}^{S, L}=\frac{T_{c m}(L-N \Delta)(1+1 / \beta)^{N_{1}-1}}{(1+1 / \beta)^{N_{1}}-1}+\Delta T_{c p}+N_{1} \theta_{c p}
$$

for algorithm $S$. By a simple term rearrangement, we have $T_{N_{1}}^{S, L}=L T_{c p, 2}+\theta_{c p, 2}$, where

$$
\theta_{c p, 2}=\Delta T_{c p}+N_{1} \theta_{c p}-\frac{N_{1} \Delta T_{c m}(1+1 / \beta)^{N_{1}-1}}{(1+1 / \beta)^{N_{1}}-1}
$$

and

$$
T_{c p, 2}=\frac{T_{c m}(1+1 / \beta)^{N_{1}-1}}{(1+1 / \beta)^{N_{1}}-1} .
$$

$T_{N_{1}}^{S, L}=L T_{c p, 2}+\theta_{c p, 2}$ can be interpreted as "the time to compute a unit of load and the computation start-up cost on a linear array of $N_{1}$ processors are $T_{c p, 2}$ and $\theta_{c p, 2}$, respectively." Consequently, the parallel execution time of algorithm $S$ running on a $k$-dimensional mesh $M e s h_{k}$ of size $N=N_{1} \times N_{2} \times \ldots \times N_{k}$ to process $L$ units of load can be expressed recursively as follows: $T_{N_{1} \times \cdots \times N_{k}}^{S, L}=L T_{c p, k+1}+\theta_{c p, k+1}$, where $\beta_{k}=\frac{T_{c p, k}}{T_{c m}}$,

$$
T_{c p, k+1}=\frac{T_{c m}\left(1+1 / \beta_{k}\right)^{N_{k}-1}}{\left(1+1 / \beta_{k}\right)^{N_{k}}-1}, \quad \Delta_{k}=\frac{\left(\theta_{c p, k}-\theta_{c m}\right)}{T_{c m}}
$$

and

$$
\theta_{c p, k+1}=N_{k} \theta_{c p, k}+\Delta_{k} T_{c p, k}-\frac{N_{k} \Delta_{k} T_{c m}\left(1+1 / \beta_{k}\right)^{N_{k}-1}}{\left(1+1 / \beta_{k}\right)^{N_{k}}-1} .
$$

The base conditions are $T_{c p, 1}=T_{c p}$ and $\theta_{c p, 1}=\theta_{c p}$. Therefore, the speed-up of algorithm $S$ running on a $k$-dimensional mesh Mesh $_{k}$ of size $N=N_{1} \times N_{2} \times \ldots \times N_{k}$ to process $L$ units of load is

$$
S_{N_{1} \times \cdots \times N_{k}}^{S, L}=\frac{L T_{c p}+\theta_{c p}}{T_{N_{1} \times \cdots \times N_{k}}^{S, L}}
$$

## TABLE 3

The Computation-to-Communication Ratio $\left(\beta_{k+1}^{A}\right)$ and Computation Start-Up Cost $\left(\theta_{k+1}^{A}\right)$ of a $k$-Dimensional Mesh of Size $N_{1} \times N_{2} \times \ldots \times N_{k}$ Processors, where $A$ is $Q, M$, $S$, or $M S, \rho_{k}=\frac{\beta_{k}}{N_{k}-1}, \Delta_{k}^{A}=\frac{\theta_{k}^{A}-\theta_{c m}}{T_{c m}}$, and the Base Conditions Are $\beta_{1}^{A}=\beta$ and $\theta_{1}^{A}=\theta_{c p}$

| A | $\beta_{k+1}^{A}$ | $\theta_{k+1}^{4}$ |
| :---: | :---: | :---: |
| $Q$ | $\boldsymbol{\beta}_{k+1}^{Q}=\frac{\left(1+1 / \boldsymbol{\beta}_{k}^{Q}\right)^{V_{k}-1}}{\left(1+1 / \boldsymbol{\beta}_{k}^{Q}\right)^{N_{k}}-1}$ | $\theta_{k+1}^{O}=\left(N_{k}-1\right) \max \left\{\theta_{k}^{O}, \theta_{c m}\right\}+\theta_{k}$ |
| M | $\beta_{k+1}^{M}=\frac{\left(1+1 / \beta_{k}^{M}\right)^{N_{k}-1}+\sum_{i=1}^{m_{k}-1}\left(\rho_{k}\right)^{i}}{\left(1+1 / \beta_{k}^{M}\right)^{N_{k}}+\frac{N_{k}}{\beta_{k}^{M}} \sum_{i=1}^{m_{k}-1}\left(\rho_{k}^{M}\right)^{i}-1}$ | $\theta_{k+1}^{M}=0$ |
| $S$ | $\beta_{k+1}^{s}=\frac{\left(1+1 / \beta_{k}^{s}\right)^{v_{k}-1}}{\left(1+1 / \beta_{k}^{s}\right)^{v_{k}}-1}$ | $\theta_{k+1}^{S}=N_{k} \theta_{k}^{S}+\Delta_{k}^{S} \beta_{k}^{S} T_{c m}-N_{k} \Delta_{k}^{S} \beta_{k+1}^{S}$ |
| MS | $\beta_{k+1}^{M S}=\frac{\left(1+1 / \beta_{k}^{M S}\right)^{N_{k}-1}+\sum_{i=1}^{m_{k}-1}\left(\rho_{k}^{M S}\right)^{i}}{\left(1+1 / \boldsymbol{\beta}_{k}^{M S}\right)^{N_{s}}+\frac{N_{k}}{\beta_{k}^{M S}} \sum_{i=1}^{m_{S}-1}\left(\rho_{k}^{M S}\right)^{i}-1}$ | $\theta_{k+1}^{M S}=\left\{\begin{array}{l} \Delta_{k}^{M S}\left(\sum_{l=0}^{m_{i}-1} \sum_{i=0}^{l}\left(\rho_{k}^{M S}\right)\right)\left(\beta_{k}^{M S} T_{c m}-N_{k} \beta_{k+1}^{M S}\right)+ \\ \left(\left(N_{k}-1\right) m_{k}+1\right) \theta_{k}^{M S} \end{array}\right.$ |

Similarly, we can derive $T_{c p, k+1}$ and $\theta_{c p, k+1}$ for algorithms $Q$ and $M S$. We summarize the results in Table 3. Therefore, the parallel execution time and the speed-up of algorithms $Q, S$, and $M S$ running on a $k$-dimensional mesh Mesh $_{k}$ of size $N=N_{1} \times N_{2} \times \ldots \times N_{k}$ to process $L$ units of load can be expressed recursively based on the results shown in Table 2 and Table 3. Notice that, for algorithms $M$ and $M S, \rho_{k}=\frac{\beta_{k}}{N_{k}-1}$.
Lemma 2. The asymptotic $\beta_{k+1}$ value of algorithm $Q$ or $S$ running on a $k$-dimensional mesh $M e s h_{k}$ of size $N=$ $N_{1} \times N_{2} \times \ldots \times N_{k}$ is $\beta_{k+1}=\frac{\beta}{k \beta+1}$.
Proof. The proof is the same as that in Lemma 1.
Theorem 11. The asymptotic speed-up of algorithm $Q$ or $S$ running on a $k$-dimensional mesh $M e s h_{k}$ of size $N=$ $N_{1} \times \ldots \times N_{k}$ is $S_{N_{1} \times \cdots \times N_{k}}^{M S}=k \beta+1$ if $N_{i}$ is very large for all $i=1$ to $k$ and $L \gg N_{k}$.

Proof. The proof can be shown directly from Lemma 2.
Lemma 3. The asymptotic $\beta_{k+1}$ value of algorithm MS running on a $k$-dimensionalmesh Mesh $_{k}$ ofsize $N=N_{1} \times N_{2} \times \ldots \times N_{k}$ is 1) $\frac{\beta}{k \beta+1}$ if $N_{i}=\infty$ for all $i=1$ to $\left.k, 2\right) \frac{\beta}{k \beta+1}$ if $m_{i}=\infty$ and $0<\rho_{i}<1$ for all $i=1$ to $k$, or 3) $\frac{\beta}{N_{1} \times \cdots \times N_{k}}$ if $m_{i}=\infty$ and $\rho_{i} \geq 1$ for all $i=1$ to $k$.
Proof. The proof is the same as that in Lemma 1.
Theorem 12. The asymptotic speed-up of algorithm MS running on a $k$-dimensional mesh Mesh $_{k}$ of size $N=N_{1} \times N_{2} \times$ $\ldots \times N_{k}$ is 1) $S_{N_{1} \times \ldots \times N_{k}}^{M S}=k \beta+1$ if $N_{i}=\infty$ for all $i=1$ to $k$, 2) $S_{N_{1} \times \cdots \times N_{k}}^{M S}=k \beta+1$ if $m_{i}=\infty$ and $0<\rho_{i} \leq 1$ for all $i=1$ to $k$, or 3) $S_{N_{1} \times \cdots \times N_{k}}^{M S}=N_{1} \times \cdots \times N_{k}$ if $m_{i}=\infty$ and $\rho_{i}>1$ for all $i=1$ to $k$.
Proof. The proof can be shown directly from Lemma 3.
Theorem 13. $S_{N_{1} \times \cdots \times N_{k}}^{S, L} \geq S_{N_{1} \times \cdots \times N_{k}}^{Q, L}$ and $S_{N_{1} \times \cdots \times N_{k}}^{S, L} \cong S_{N_{1} \times \cdots \times N_{k}}^{Q, L}$ if $L \gg N_{i}$ for all $i=1$ to $k$.
Proof. The proof is the same as that in Theorem 5.
Theorem 14. $S_{N_{1} \times \cdots \times N_{k}}^{S, L} \geq S_{N_{1} \times \cdots \times N_{k}}^{M S, L}$ when $L$ is a constant and $N_{i}$ tends to infinity for all $i=1$ to $k$.


Fig. 7. Speed-up of algorithms $Q$ and $M$, where $L=10,000, T_{c m}=1$, $T_{c p}=\beta=100$, and $m=5$.

Proof. The proof is the same as that in Theorem 7.
Theorem 15. $S_{N_{1} \times \cdots \times N_{k}}^{M S, L} \geq S_{N_{1} \times \cdots \times N_{k}}^{S, L}$ when $N_{i}$ is a constant for all $i=1$ to $k$ and $L$ tends to infinity.
Proof. The proof is the same as that in Theorem 8.

## 6 Discussions of the Results

In Fig. 7, we illustrate the numerical results for the speed-ups of algorithms $Q$ and $M$ without start-up costs (that is, $\theta_{c p}=\theta_{c m}=0$ ). We assume that $\beta=100$ and the number of installments for algorithm $M$ is $m=5$. Both algorithms $Q$ and $M$ approach their maximal speed-ups of $\beta+1$ when $N$ is very large. However, the speed-up of algorithm $M$ approaches $\beta+$ 1 more quickly than algorithm $Q$. For example, when the number of processors in the linear array is $N=200$, the speedup of algorithm $M$ reaches 100.2 , which is very close to the asymptotic speed-up, whereas the speed-up of algorithm $Q$ only reaches 87.2 . When $N=500$, algorithm $M$ reaches the maximal speed-up 101, whereas the speed-up of algorithm $Q$ is 100.3. We also calculate the results for $T_{c p}=T_{c m}$, that is, $\beta=1$. In this case, both algorithms $Q$ and $M$ reach the maximal speed-up $(1+\beta=2)$ very quickly because $(1+$ $1 / \beta)^{N}$ and $(1+1 / \beta)^{N-1}$ dominate the other terms in the speed-up equations.

Next, we show the numerical results for the speed-ups of algorithms $Q, S$, and $M S$ with two sets of start-up costs. Fig. 8a shows the results for $\theta_{c m}=1$ and $\theta_{c p}=2$. Algorithm $S$ performs consistently better than algorithm $Q$. With $N_{M a x}^{Q}=$ 395.65 from Section 4.4 and speed-up calculations for $N=$ 395 and 396 , we find that the best speed-up of algorithm $Q$ is 91.84 when $N=396$. The speed-up of algorithm $S$ approaches 91.84 when $N=285$. The best speed-up of algorithm $S$ is 94.66 when $N=459$. When $N$ is large, algorithm $M S$ does not perform better than algorithm $Q$ or $S$, which is in agreement with Theorem 7. However, algorithm $M S$ performs better than algorithm $Q$ or $S$ when $N$ is small. The best speed-up of algorithm $M S$ is 89.82 when $N=177$. Fig. 8 b shows the results for large start-up costs, $\theta_{c m}=100$ and $\theta_{c p}=100$. Fig. 8b has a similar performance trend to Fig. 8a in that, when $N$ is small $(N<100)$, algorithm $M S$ performs the best, and when $N$ is large ( $N>100$ ), algorithm $M S$ performs the worst. Also, the speed-ups of algorithms $Q$ and $S$ are almost the same because the start-up costs are large.

Finally, we illustrate the numerical results for the speedups of algorithms $Q, S$, and $M S$ on 2D and 3D meshes. We


Fig. 8. Speed-up of algorithms $Q, S$, and $M S . L=10,000, T_{c m}=1, T_{c p}=\beta=100$, and $m=5$. (a) $\theta_{c p}=2$ and $\theta_{c m}=1$. (b) $\theta_{c p}=100$ and $\theta_{c m}=100$.


Fig. 9. Speed-ups of algorithms $Q, S$, and $M S$, where $L=200,000, T_{c m}=1, T_{c p}=\beta=100, \theta_{c p}=2$, and $\theta_{c m}=1$. (a) Two-dimensional meshes. (b) Three-dimensional meshes.
set $L=200,000, T_{c m}=1$, and $T_{c p}=\beta=100$. In Fig. $9 \mathrm{a}, m_{1}$ is set to 2 and $m_{2}$ is set to the optimal numbers of installments (that is, $m_{\max }$ ), which are $8,25,11,7,5,2,2,2$, 2, and 2 for algorithm $M S$ on 2D meshes of $5 \times 10,5 \times 20,5$ $\times 30,5 \times 40,5 \times 50,5 \times 60,5 \times 70,5 \times 80,5 \times 90$, and $5 \times$ 100 processors, respectively. When the mesh is smaller than or equal to $5 \times 50$ processors, the speed-ups of algorithms $Q$ and $S$ are almost the same, and algorithm $M S$ performs the best. As in the linear array, when the number of processors in the mesh increases, algorithm $M S$ does not perform better than algorithm $Q$ or $S$. Algorithm $S$ always perform better than algorithm $Q$. The speed-ups for 3D meshes of various sizes with $m_{1}=m_{2}=m_{3}=2$ are shown in Fig. 9b. As in 2D meshes, algorithm $M S$ performs best when the mesh is small and algorithm $S$ performs best when the mesh is large.

## 7 Conclusion

We have proposed a set of algorithms, $M, S$, and $M S$, which employ the multi-installment processing technique and computation and communication start-up costs to distribute divisible load on linear arrays. The extension to multidimensional meshes is also presented. We derived the closed-form solutions for the parallel execution times and speed-ups of the proposed algorithms. When the computation and communication start-up costs are not considered, the proposed algorithm $M$ performs better than algorithm $Q$ [21] in all cases. When the computation and communication start-up costs are considered, the proposed algorithm $S$ performs better than algorithm $Q$
[21] in all cases and the proposed algorithm $M S$ performs better than algorithm $S$ and $Q$ when the number of processors is small or when the total load is very large.

## Appendix A

The maximum number of processors $N_{\max }^{Q}$ such that, if the number of processors exceeds $N_{\max }^{Q}$, then the speed-up of algorithm $Q$ decreases, is obtained by differentiating $T_{N}^{Q, L}$ with respect to $N$ as

$$
\begin{aligned}
\frac{d}{d N} T_{N}^{Q, L}= & \frac{-L T_{c m}(1+1 / \beta)^{N_{\max }^{Q}-1} \ln (1+1 / \beta)}{\left((1+1 / \beta)^{N_{\max }^{Q}}-1\right)^{2}} \\
& +\max \left\{\theta_{c p}, \theta_{c m}\right\}=0
\end{aligned}
$$

This equation can be simplified by setting a new constant $u=(1+1 / \beta)$ and a new variable $x=u^{N_{\text {max }}^{Q}}$ as follows: $x^{2}-\left(\frac{L T_{c m} \ln u}{u \max \left\{\theta_{c p}, \theta_{c n}\right\}}+2\right) x+1=0$. As a result, $N_{\max }^{Q}$ can be obtained by solving the quadratic equation of $x$.

## Appendix B

For the case of $\rho \neq 1$ in algorithm $M S$, the number of installments $m_{\max }$ such that, if the number of installments exceeds $m_{\text {max }}$, then the speed-up decreases, is obtained by differentiating $T_{N, m}^{M S, L}$ with respect to $m$ as

$$
\frac{d T_{N, m}^{M S, L}}{d m}=T_{c p} \frac{d L_{N}^{M S}}{d m}+(N-1) \theta_{c p}=0
$$

Through a simple but tedious derivation process, we obtain $\frac{d T_{N, m}^{M S, L}}{d m}=0=D \rho^{2 m}+E m \rho^{m}+F \rho^{m}+G$, where constants $D$, $E, F$, and $G$ are complex and are shown below. This nonlinear exponential equation can be solved simply by the Newton numerical method to obtain $m_{\max }$.

From Section 4.3, we have

$$
\begin{aligned}
& L_{N}^{M S} \\
&= \frac{1 / \beta\left(L-\frac{N \Delta}{\rho-1}\left(\frac{\rho^{m+1}-\rho}{\rho-1}-m\right)\right)\left((1+1 / \beta)^{N-1}+\frac{\rho}{\rho-1}\left(\rho^{m-1}-1\right)\right)}{(1+1 / \beta)^{N}+\frac{N \rho}{\beta(\rho-1)}\left(\rho^{m-1}-1\right)-1} \\
&+\frac{\Delta}{\rho-1}\left(\frac{\rho^{m+1}-\rho}{\rho-1}-m\right) \\
&= \frac{\frac{1}{\beta}\left(L+\frac{N \Delta \rho}{(\rho-1)^{2}}-\frac{N \Delta}{(\rho-1)^{2}} \rho^{m+1}+\frac{N \Delta}{\rho-1} m\right)\left(\frac{1}{\rho-1} \rho^{m}+(1+1 / \beta)^{N-1}-\frac{\rho}{\rho-1}\right)}{\frac{N \rho^{m}}{\beta(\rho-1)}+\left(1+1 \frac{1}{\beta}\right)^{N}-\frac{N \rho}{\beta(\rho-1)}-1} \\
&+\frac{\Delta}{\rho-1}\left(\frac{\rho^{m+1}}{\rho-1}-m\right)-\frac{\Delta \rho}{(\rho-1)^{2}} \\
&= \frac{\rho^{m}\left(\frac{\Delta}{\rho-1}\left(1+\frac{1}{\beta}\right)^{N-1}+\frac{L}{\beta(\rho-1)}-\frac{\Delta \rho}{(\rho-1)^{2}}\right)+m\left(\frac{\Delta}{(\rho-1)}-\left(1+\frac{1}{\beta}\right)^{N-1}\right)}{\frac{N \rho^{m}}{\beta(\rho-1)}+C_{0}} \\
&+\frac{C_{2} C_{3} / \beta-C_{0} C_{1}}{\frac{N \rho^{m}}{\beta(\rho-1)}+C_{0}} \\
&= \frac{A \rho^{m}+B m+C_{2} C_{3} / \beta-C_{0} C_{1}}{\frac{N \rho^{m}}{\beta(\rho-1)}+C_{0}}
\end{aligned}
$$

where

$$
\begin{aligned}
A & =\frac{\Delta}{\rho-1}\left(1+\frac{1}{\beta}\right)^{N-1}+\frac{L}{\beta(\rho-1)}-\frac{\Delta \rho}{(\rho-1)^{2}}, \\
B & =\frac{\Delta}{(\rho-1)}-\left(1+\frac{1}{\beta}\right)^{N-1}, C_{0}=\left(1+1 \frac{1}{\beta}\right)^{N}-\frac{N \rho}{\beta(\rho-1)}-1, \\
C_{1} & =\frac{\Delta \rho}{(\rho-1)^{2}}, C_{2}=L+\frac{N \Delta \rho}{(\rho-1)^{2}}, \text { and } \\
C_{3} & =(1+1 / \beta)^{N-1}-\frac{\rho}{\rho-1} .
\end{aligned}
$$

Because $\frac{d T_{N}^{N S}}{d m}=T_{c p} \frac{d L_{N}^{M S}}{d m}+(N-1) \theta_{c p}=0$, we have

$$
\begin{aligned}
0= & T_{c p}\left(m \rho^{m}\left(\frac{B(\ln \rho) N}{\beta(\rho-1)}\right)+\rho^{m}\left(\frac{-(\ln \rho) N\left(C_{2} C_{3} / \beta-C_{0} C_{1}\right)}{\beta(\rho-1)}\right.\right. \\
& \left.\left.+\frac{B N}{\beta(\rho-1)}+A(\ln \rho) C_{0}\right)+B C_{0}\right) \\
& +\frac{(N-1) \theta_{c p} N^{2} \rho^{2 m}}{\beta^{2}(\rho-1)^{2}}+2 C_{0} \frac{(N-1) \theta_{c p} N \rho^{m}}{\beta(\rho-1)}+(N-1) \theta_{c p} C_{0}^{2} \\
= & \rho^{2 m} \frac{(N-1) \theta_{c p} N^{2}}{\beta^{2}(\rho-1)^{2}}+m \rho^{m}\left(\frac{B(\ln \rho) N T_{c p}}{\beta(\rho-1)}\right)+B C_{0} T_{c p} \\
& +(N-1) \theta_{c p} C_{0}^{2}+\rho^{m}\left(\frac{-T_{c p}(\ln \rho) N\left(C_{2} C_{3} / \beta-C_{0} C_{1}\right)}{\beta(\rho-1)}\right. \\
& \left.+\frac{T_{c p} B N}{\beta(\rho-1)}+T_{c p} A(\ln \rho) C_{0}+2 C_{0} \frac{(N-1) \theta_{c p} N}{\beta(\rho-1)}\right) \\
= & D \rho^{2 m}+E m \rho^{m}+F \rho^{m}+G,
\end{aligned}
$$

where

$$
\begin{aligned}
D= & \frac{(N-1) \theta_{c p} N^{2}}{\beta^{2}(\rho-1)^{2}}, E=\frac{B(\ln \rho) N T_{c p}}{\beta(\rho-1)} \\
G= & B C_{0} T_{c p}+(N-1) \theta_{c p} C_{0}^{2}, \text { and } \\
F= & \frac{-T_{c p}(\ln \rho) N\left(C_{2} C_{3} / \beta-C_{0} C_{1}\right)}{\beta(\rho-1)}+\frac{T_{c p} B N}{\beta(\rho-1)} \\
& +T_{c p} A(\ln \rho) C_{0}+2 C_{0} \frac{(N-1) \theta_{c p} N}{\beta(\rho-1)} .
\end{aligned}
$$

## References

[1] D. Altilar and Y. Paker, "Optimal Scheduling Algorithms for Communication Constrained Parallel Processing," Proc. Eighth Int'l Euro-Par Conf. (Euro-Par '02), pp. 197-206, 2002.
[2] V. Bharadwaj and H.M. Wong, "Scheduling Divisible Loads on Heterogeneous Linear Daisy Chain Networks with Arbitrary Processor Release Times," IEEE Trans. Parallel and Distributed Systems, vol. 15, no. 3, pp. 273-288, Mar. 2004.
[3] V. Bharadwaj, D. Ghose, and V. Mani, "Multi-Installment Load Distribution in Tree Networks with Delays," IEEE Trans. Aerospace and Electronic Systems, vol. 31, no. 2, pp. 555-567, 1995.
[4] V. Bharadwaj, D. Ghose, V. Mani, and T.G. Robertazzi, Scheduling Divisible Loads in Parallel and Distributed Systems. IEEE CS Press, 1996.
[5] V. Bharadwaj, D. Ghose, and T. Robertazzi, "Divisible Load Theory: A New Paradigm for Load Scheduling in Distributed Systems," Cluster Computing, vol. 6, no. 1, pp. 7-17, 2003.
[6] V. Bharadwaj and H.M. Wong, "Scheduling Divisible Loads on Heterogeneous Linear Daisy Chain Networks with Arbitrary Processor Release Times," IEEE Trans. Parallel and Distributed Systems, vol. 15, no. 3, 273-288, Mar. 2004.
[7] J. Blazewicz, M. Drozdowski, and M. Markiewicz, "Divisible Task Scheduling-Concept and Verification," Parallel Computing, vol. 25, pp. 87-98, 1999.
[8] J. Blazewicz, M. Drozdowski, F. Guinand, and D. Trystram, "Scheduling a Divisible Task in a 2-Dimensional Mesh," Discrete Applied Math., vol. 94, nos. 1-3, pp. 35-50, 1999.
[9] S. Chan, V. Bharadwaj, and D. Ghose, "Large Matrix-Vector Products on Distributed Bus Networks with Communication Delays Using the Divisible Load Paradigm: Performance and Simulation," Math. and Computers in Simulation, vol. 58, pp. 71-92, 2001.
[10] S. Charcranoon, T.G. Robertazzi, and S. Luryi, "Parallel Processor Configuration Design with Processing/Transmission Costs," IEEE Trans. Computers, vol. 49, no. 9, pp. 987-991, Sept. 2000.
[11] Y.C. Cheng and T.G. Robertazzi, "Distributed Computation with Communication Delays," IEEE Trans. Aerospace and Electronic Systems, vol. 24, no. 6, pp. 700-712, Nov. 1988.
[12] M. Drozdowski, Selected Problems of Scheduling Tasks in Multiprocessor Computing Systems. Poznan Univ. of Technology Press, 1997.
[13] M. Drozdowski and W. Glazek, "Scheduling Divisible Loads in a Three-Dimensional Mesh of Processors," Parallel Computing, vol. 25, no. 4, pp. 381-404, 1999.
[14] M. Drozdowski and P. Wolniewicz, "Out-of-Core Divisible Load Processing," IEEE Trans. Parallel and Distributed Systems, vol. 14, no. 10, pp. 1048-1056, Oct. 2003.
[15] M. Drozdowski and P. Wolniewicz, "Performance Limits of Divisible Load Processing in Systems with Limited Communication Buffers," J. Parallel and Distributed Computing, vol. 64, no. 8, pp. 960-973, 2004.
[16] D. Ghose and H.J. Kim, "Computing BLAS Level-2 Operations on Workstation Clusters Using the Divisible Load Paradigm," Math. and Computer Modelling, vol. 41, no. 1, pp. 49-70, Jan. 2005.
[17] W. Glazek, "Distributed Computation in a Three-Dimensional Mesh with Communication Delays," Proc. Sixth Euromicro Workshop Parallel and Distributed Processing, pp. 38-42, Jan. 1998.
[18] J. Guo, J. Yao, and L.N. Bhuyan, "An Efficient Packet Scheduling Algorithm in Network Processors," Proc. INFOCOM, Mar. 2005.
[19] C. Lee and M. Hamdi, "Parallel Image Processing Applications on a Network of Workstations," Parallel Computing, vol. 21, pp. 137160, 1995.
[20] K. Li, "Managing Divisible Load on Partitionable Networks," High Performance Computing Systems and Applications, J. Schaeffer, ed., pp. 217-228, Kluwer Academic Publishers, 1998.
[21] K. Li, "Improved Methods for Divisible Load Distribution on k-Dimensional Meshes Using Pipelined Communications," IEEE Trans. Parallel and Distributed Systems, vol. 14, no. 12, pp. 12501261, Dec. 2003.
[22] K. Li, "Accelerating Divisible Load Distribution on Tree and Pyramid Networks Using Pipelined Communications," Proc. 18th Int'l Parallel and Distributed Processing Symp. (IPDPS '04), p. 228, Apr. 2004.
[23] X. Li, B. Veeravalli, and C.C. Ko, "Divisible Load Scheduling on a Hypercube Cluster with Finite-Size Buffers and Granularity Constraints," Proc. First IEEE/ACM Int'l Symp. Cluster Computing and the Grid (CCGrid '01), pp. 660-667, May 2001.
[24] X. Li, V. Bharadwaj, and C.C. Ko, "Distributed Image Processing on a Network of Workstations," Int'l J. Computers and Applications, vol. 25, no. 2, pp. 1-10, 2003.
[25] T.G. Robertazzi, "Ten Reasons to Use Divisible Load Theory," Computer, pp. 63-68, May 2003.
[26] B. Veeravalli, X. Li, and C.C. Ko, "On the Influence of Start-Up Costs in Scheduling Divisible Loads on Bus Networks," IEEE Trans. Parallel and Distributed Systems, vol. 11, no. 12, pp. 12881305, Dec. 2000.
[27] R. Wang, A. Krishnamurthy, R. Martin, T. Anderson, and D. Culler, "Modeling Communication Pipeline Latency," Proc. Joint Int'l Conf. Measurement and Modeling of Computer Systems (SIGMETRICS '98/PERFORMANCE '98), pp. 22-32, 1998.
[28] Y. Yang, K.V.D. Raadt, and H. Casanova, "Multiround Algorithms for Scheduling Divisible Loads," IEEE Trans. Parallel and Distributed Systems, vol. 16, no. 11, pp. 1092-1102, Nov. 2005.


Yeim-Kuan Chang received the MS degree in computer science from the University of Houston, Clear Lake, in 1990 and the PhD degree in computer science from Texas A\&M University in 1995. He is currently an assistant professor in the Department of Computer Science and Information Engineering, National Cheng Kung University, Taiwan, ROC. His research interests include computer architecture, multiprocessor systems, Internet router design, and computer networking.


Jia-Hwa Wu received the BS degree in mechanical engineering from Feng Chia University, Taiwan, in 1981 and the MBA degree in industrial management from the National Cheng Kung University, Taiwan, in 1986. He is currently a doctoral candidate in the Department of Computer Science and Information Engineering, National Cheng Kung University, Taiwan, ROC. His research interests include parallel compilers, data mining, and Internet computing.


Chi-Yeh Chen received the BS degree in communication engineering from Da-Yeh University, Changhua, Taiwan, ROC, in 2001 and the MS degree in computer science and information and engineering from the National Cheng Kung University, Tainan, Taiwan, ROC, in 2005. He is currently an engineer in the Plasma and Space Science Center, National Cheng Kung University, Tainan, Taiwan, ROC. His research interests include parallel algorithm and load distribution.


Chih-Ping Chu received the BS degree in agricultural chemistry from the National Chung Hsing University, Taiwan, the MS degree in computer science from the University of California, Riverside, and the PhD degree in computer science from Louisiana State University. He is currently a professor in the Department of Computer Science and Information Engineering, National Cheng Kung University, Taiwan, ROC. His research interests include parallelizing compilers, parallel computing, parallel processing, Internet computing, DNA computing, and software engineering.
$\triangleright$ For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.


[^0]:    - The authors are with the Department of Computer Science and Information Engineering, National Cheng Kung University, No. 1, Ta-Hsueh Road, Tainan, Taiwan, ROC.
    E-mail: ykchang@mail.ncku.edu.tw, alpha@mail.stut.edu.tw, raccoonz3@gmail.com, chucp@csie.ncku.edu.tw.
    Manuscript received 22 May 2006; revised 24 Oct. 2006; accepted 2 Feb. 2007; published online 23 Apr. 2007.
    Recommended for acceptance by T. Davis.
    For information on obtaining reprints of this article, please send e-mail to: tpds@computer.org, and reference IEEECS Log Number TPDS-0132-0506. Digital Object Identifier no. 10.1109/TPDS.2007.1103.

